Lossless Compression and Complexity of Chaotic Sequences 



O 

m 
(N 



Q 
U 



Nithin Nagaraj,' Mathew Shaji Kavalekalam, ' Arjun Venugopal T.,' and Nithin Krishnan' 

'Department of Electronics and Communications, Amrita School of Engineering, 
Amrita Vishwa Vidyapeetham, Amritapuri Campus, Clappana P.O., Kollam, Kerala - 690 525. 

We investigate the complexity of shiort symbolic sequences of cliaotic dynamical systems by using lossless 
compression algorithms. In particular, we study Non-Sequential Recursive Pair Substitution (NSRPS), a lossless 
compression algorithm first proposed by W. Ebeling et al. [Math. Biosc. 52, 1980] and Jimenez-Montaiio et 
al. |arXiv:cond-mat/0204134 2002]) which was subsequently shown to be optimal. NSPRS has also been 
used to estimate Entropy of written English (P. Grassberger |arXiv:physics/0207023 , 2002]). We propose a new 
measure of complexity - defined as the number of iterations of NSRPS required to transform the input sequence 
into a constant sequence. We test this measure on symbolic sequences of the Logistic map for various values of 
the bifurcation parameter. The proposed measure of complexity is easy to compute and is observed to be highly 
correlated with the Lyapunov exponent of the original non-linear time series, even for very short symbolic 
sequences (as short as 50 samples). Finally, we construct symbolic sequences from the Skew-Tent map which 
are incompressible by popular compression algorithms like WinZip, WinRAR and 7-Zip, but compressible by 
NSRPS. 



I. INTRODUCTION 

Measuring complexity of experimental time series is one 
of the important goals of mathematical modeling of natural 
phenomena. A measure of complexity gives an insight in to 
the phenomenon being studied. For example, in a study of 
population dynamics of the fruit-fly, a measure of complexity 
of the time series (population size of generations) will throw 
light on the persistence and stability of the population. If the 
complexity is low, then it is possible that the population is 
exhibiting a periodic behavior, i.e. fluctuating between a high 
population size and a low one alternately. Complexity also 
plays a very important role in determining whether a sequence 
is random or not in Cryptography applications. 

Different measures of complexity such as Lyapunov Ex- 
ponent, Kolmogorov complexity. Algorithmic complexity etc. 
are proposed in the literature While complexity has sev- 
eral facets. Shannon entropy |2] is one of the reliable indi- 
cators of 'compressibility' which can serve as a measure of 
complexity. It is given by the following expression: 

M 

H{X) = ~^^Pi\og2{pi) bits/symbol, (1) 

where X is the symbolic sequence with M symbols and pi 
is the probability of the i-th symbol for a block-size of one. 
Block-size refers to the number of input symbols taken to- 
gether to compute the probability mass function. 

Shannon entropy plays an important role in lossless data 
storage and communications. Shannon's Noiseless Source 
Coding theorem |2] provides an upper limit on the com- 
pression ratio achievable by lossless compression algorithms. 
This limit is given by the Shannon entropy. Numerous al- 
gorithms have been designed with the aim of achieving this 
limit. Huffman coding, Shannon-Fano coding. Arithmetic 
coding, Lempel-Ziv coding are a few examples of lossless 
compression algorithms which achieve the Shannon entropy 
limit for stochastic i.i.d sources (independent and identically 
distributed) IH^]. However, practical estimation of entropy 
of sources is non-trivial since most sources are not i.i.d but 



contain correlations (short or long-range). As a simple exam- 
ple, in the English language, the probability of the occurrence 
of the letter 'u' after the letter 'g' has occuiTed, is nearly one. 

In this paper, we are interested in measuring complexity of 
short symbolic sequences which are obtained from time series 
generated by chaotic non-linear dynamical systems (we have 
used the Logistic map in our study and we expect the results 
to hold for other systems as well). 

This paper is organized as follows. In the next section, we 
highlight the challenges in measuring an estimate of Shan- 
non entropy for short sequences. In section III, we intro- 
duce NSRPS and propose a new measure of complexity based 
on this algorithm. Subsequently, in section IV, we test the 
new measure on several (short) sequences from the Logistic 
map and compare the complexity with a uniformly distributed 
random sequence Ii 1 3.] . The complexity measure based on 
NSRPS is compared with Lyapunov Exponent. In section 
V, we construct chaotic sequences which are incompressible 
by popular lossless compression algorithms, but which can be 
compressed by NSRPS. We conclude in section VI indicating 
directions for future work. 



II. DRAWBACKS OF SHANNON ENTROPY AS A 
COMPLEXITY MEASURE 

Shannon entropy can serve as a good indicator for complex- 
ity, but estimation of entropy is not a trivial task. Determining 
Shannon entropy of experimental time series is particularly 
challenging owing to the following reasons: 

1 . Analytical determination of the entropy is not easy even 
for a simple model of the experimental time series. 

2. The time series typically consists of real numbers. In 
order to calculate the entropy, it has to be converted into 
a symbolic sequence. The choice of the partition has 
a very important role to play in the estimation of the 
entropy. Ebeling et al. |5] shows that depending on the 
choice of the partition, the results can vary widely. 
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3. Noise is inevitable in any experiment. Noise has the 
tendency to increase entropy. 

4. Length of the time series is another important factor in 
the accurate determination of entropy. Shannon entropy 
requires the estimation of the probabiUty mass function, 
which is difficuh to accurately estimate with a short 
time series. Biological time series such as population 
sizes are typically of very small lengths, around 50-100 
samples (since actual experiments are time consuming). 
Entropy estimation methods in literature require 1000 to 
10000 samples [5J. 

In order to overcome these drawbacks, researchers have 
used lossless compression algorithms in order to estimate 
complexity or entropy . Lempel-Ziv and its popular vari- 
ations are extensively used by several researchers to determine 
complexity of time series ([6] and references therein). As we 
shall demonstrate in section V, this is not always reliable for 
short sequences. 



A. Effect of length of time series on Entropy estimation 

Fig. [T] shows the effect of length of the time series on nu- 
merical computation of the Shannon entropy. For a data- 
length L ~ 200, as the bifurcation parameter of the Logistic 
map is varied from 3.5 to 4.0, we observe that the numeri- 
cally estimated Shannon entropy (equation (1)) is poorly cor- 
related with Lyapunov exponent with a correlation coefficient 
of -0.2682. When the data-length is increased to L = 5000, 
Shannon entropy comes close to the Lyapunov exponent with 
a correlation coefficient of 0.8934. Ebeling [5J demonstrates 
that for the Logistic map. Shannon entropy comes very close 
to the Lyapunov exponent as the block-size increases to 10 
and for large data-lengths (L > 1000). 
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FIG. 1: Numerically computed Shannon entropy vs. Lyapunov Ex- 
ponent A as the bifurcation parameter is varied. 8 bins □ were used 
for the numerical computation of Shannon entropy using equation 
(1) and data-length L = 200. For computation of A, equation (3) 
was used. The two graphs are poorly correlated as indicated by a 
correlation coefficient of -0.2682. 

"The 8 bins are uniformly spaced between to 1 and the input time series is 
converted into a symbolic sequence consisting of 8 symbols corresponding to 
these bins. 



III. NSRPS-BASED MEASURE OF COMPLEXITY 

In this section, we propose a new measure of complex- 
ity based on a lossless compression algorithm called Non- 
sequential Recursive Pair Substitution (NSRPS). NSRPS was 
first proposed by Ebeling et al. [9] and later improved by 
Jimenez-Montano et al. lUOll . It was subsequently shown to be 
optimal [11]. NSPRS has also been used to estimate Entropy 
of written English |7]. The algorithm is briefly described as 
follows. 

Let the original sequence be called X. At the first iteration, 
find which pair of symbols have maximum number of occur- 
rences and replace all its non-overlapping occurrences with 
a new symbol. For example, the input sequence '01101011' 
is transformed into '21221' since the pair '01' has maximum 
number of occurrences compared to other pairs ('00', '10' and 
'11'). In the second iteration, '21221' is transformed to '323' 
since '21' has maximum frequency. The algorithm proceeds 
in this fashion until the length of the string is 1 (at which stage 
there is no pair to substitute). In this example, in the third iter- 
ation, '323' is transformed into '43' and in the fourth iteration 
it is transformed into '5' and the algorithm stops. 

The following observations can be made about the algo- 
rithm: 

1. The algorithm always terminates for finite length se- 
quences. 

2. After each iteration, the length of the sequence reduces. 
The number of distinct symbols may or may not in- 
crease (if the input sequence is '0000', then it is trans- 
formed to '22' and then to '3'). 

3. The quantity 'entropy x length' may increase or de- 
crease across the iterations. 

4. Ultimately, the quantity 'entropy x length' has to go 
to zero since the length eventually reaches 1 at which 
point the entropy is (since there is now only one sym- 
bol, it occurs with probability 1). A faster way for 
this quantity to go to zero is when the sequence gets 
transformed to a constant sequence (which has only 
one distinct symbol and hence zero entropy). Let N 
be the number of iterations required for the quantity 
'entropy x length' to reach zero. N is always a pos- 
itive integer. The minimum value of N is zero (for the 
constant sequence) and maximum is i — 1 where L is 
the length of the sequence (for a sequence either with 
distinct symbols are with all pairs being distinct). 

5. The algorithm as described above is not reversible, i.e. 
the original symbolic sequence can't be restored by the 
sequence at subsequent iterations. In order to make the 
algorithm reversible, we have to maintain a record of 
the specific pair of symbols which was substituted at 
each iteration. The bits required to store this overhead 
information compensates for the reduction in the num- 
ber of bits needed to store the transformed sequence. 
For achieving the best lossless compression ratio, we 
stop at the iteration number at which the total number 
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of bits required to store the transformed sequence and 
the overhead is a minimum (and hopefully lesser than 
the size of the original sequence). 



A. Definition of tlie new complexity measure 

The number of iterations N for the quantity 'entropy x 
length' to approach zero by the NSRPS algorithm (as de- 
scribed above) is defined as our new complexity measure. N 
is an integer in the range fO, L — 1]. 

Jimenez-Montano llioll actually tracks the quantity 
'entropy x length' across the iterations of NSRPS. While 
this is important, our motivation to use as a complexity 
measure is the following. N actually represents the effort 
required by NSRPS algorithm to transform the input sequence 
into a constant sequence (having only one distinct symbol and 
hence zero entropy). A sequence which is highly redundant 
would naturally have a lower value of N. As an example, 
the sequences A = 01010101 and B = 01001110 have 
the same length (8) and the same entropy of 1 bits/symbol 
(block-size=l). However, sequence A requires only = 1 
iteration for the quantity entropy x length to reach zero, 
whereas B requires = 6 iterations ifTill . Clearly, B is more 
complex than A (A is periodic, B has no obvious pattern). 



IV. RESULTS AND DISCUSSION 

In this section, we shall evaluate the usefulness of the new 
complexity measure based on NSRPS described in the previ- 
ous section. To this end, we consider sequences arising from 
the Logistic map for various values of the bifurcation param- 
eter 'a'. We know that the complexity of the time series in- 
creases with 'a', with occasional dips owing to the presence 
of windows (attracting periodic orbits). 
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FIG. 2: Complexity of different sequences (L = 200): entropy x 
length vs. number of iterations. 

In Fig. 121 the quantity entropy x length is plotted along 
the Y-axis and iteration number along the X-axis. The length 
of all sequences is L = 200. The new complexity measure 
N is the iteration number when the graph hits X-axis. As 
it can be seen, different sequences have different values of 
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FIG. 3: NSRPS based complexity measure N{a) vs. Lyapunov Ex- 
ponent A(a) as the bifurcation parameter 'a is varied between 3.5 to 
4.0. We have used 8 bins for deriving the symbolic sequence from 
the time series. The data-length is L = 200. For computation of 
A(a) we have used equation (3). A(a) was scaled by a factor of 70. 
The two graphs are highly correlated as indicated by a correlation 
coefficient of 0.8832. Compare this with Fig.[T] 



N. As expected, the sequence with the highest complexity is 
the independent and uniformly distributed random sequence 
{rand{) in MATLAB). The order of complexity (from higher 
to lower) is random )~ a = 4.0 >- 3.9 >- 3.75 >- 3.83. There 
is an attracting periodic orbit (window) at a = 3.83 and this 
explains the lower value of N. 

Table 1 shows the effect of data-length and number of bins 
on the new measure N for the Logistic map. As we vary the 
bifurcation parameter 'a' between 3.5 to 4.0, we find that even 
for L = 50, the correlation coefficient (CC) of N{a) with the 
Lyapunov Exponent A (a) is quite good. The entropy H (cal- 
culated using equation (1)) is very poorly correlated with A. 
For 2 bins, even at L = 5000, we found the CC of H{a) and 
A(a) to be 0.3565. Compare this with Table 1: for L = 50 
and 2 bins, the CC is already 0.665 1 . This shows that the new 
measure is quite good for very short symbolic sequences. Fig- 
ure [3] shows the graphs of N{a) and A (a) (scaled by a factor 
of 70 for better visibility and ease of comparison). 



TABLE I: Effect of data-length and number of bins on the new mea- 
sure A'^ in comparison with Lyapunov Exponent (A). CC stands for 
correlation coefficient between N{a) and A(a) as the bifurcation pa- 
rameter 'a' of the Logistic map is varied from 3.5 to 4.0. 



L 


# of Bins 


CC 




2 


0.6651 


50 


4 


0.6654 




8 


0.7324 




2 


0.8352 


100 


4 


0.8149 




8 


0.8172 




2 


0.8870 


200 


4 


0.8648 




8 


0.8832 
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Lyapunov Exponent A is given by the equation: 

1 " 

A= lim - Vln(|/'(xO|). (2) 

n— >-oo Ti — ' 

For the Logistic map, we have used the following equation to 
estimate A; 

, 1000 

^(") = I55^El"(l«(l-2x,)|), (3) 

2—1 

where 'a' is the bifurcation parameter (3.5 to 4.0) and x\ is a 
randomly chosen initial condition in the interval (0,1). 

The number of bins determines the number of symbols for 
the initial sequence. As L and number of bins increase, the 
CC gets better and better. 



other popular compression algorithms expand (all these use 
some variation of Lempel-Ziv compression algorithm). This 
behaviour was observed for values of a between 0.5 and 0.7. 
Rigorous investigation of these interesting sequences needs to 
be performed. 

TABLE II: Chaotic sequences from the Skew-tent map subjected to 
lossless compression algorithms. All numbers are in bits. As it can 
be seen, only NSRPS manages to compress the sequence. 



Input size 


WinZip 


WinRAR 


7-Zip 


NSRPS 


1024 


1616 


1376 


1816 


912 



VI. CONCLUSIONS AND FUTURE WORK 



V. CHAOTIC SEQUENCES FROM SKEW-TENT MAP 

Complexity measures based on lossless data compression 
are not always accurate, especially for short data lengths, as 
we shall demonstrate. Consider the Skew-tent map [J2il : 

Xi+i = — if < a;^ < a, 
a 



1 — a 

Here 'a' can be any value in the interval [0.5,1). For a = 0.5, 
we have the well-known Tent map. 

Using the value a = 0.65, data length L ~ 1024 and using a 
random initial condition, we first obtain a chaotic time series. 
From this, we find the symbolic sequence with 2 bins. The 
first bin is [0, 0.5) corresponding to symbol '0' and the second 
bin [0.5,1) corresponding to symbol '1'. The symbols '0' and 
T are equally likely since the invariant distribution for the 
Skew-tent map is uniform fl2il . This implies that the Shannon 
entropy is 1 bits/symbol. 

For compression using NSRPS, the overhead information 
was taken in to account. TablelUshows the efficacy of NSRPS 
for compressing such chaotic sequences of short length while 



The new measure is able to correctly characterize the com- 
plexity of chaotic sequences as demonstrated for the Logistic 
map (for different values of the bifurcation parameter) and a 
uniformly distributed random sequence. This new measure is 
highly correlated with the Lyapunov exponent even for very 
small data-lengths, as low as L = 50. 

Future work would be to investigate the effect of various 
kinds of noise (corrupting the time series) on the complex- 
ity measure N. We have reasons to believe that N would 
be robust to noise to some extent since we are working on 
the symbolic sequence. The new measure needs to be fur- 
ther tested for various dynamical systems (maps and flows) 
and stochastic time series of different distributions, and to 
non-uniform bin structures. The data compression aspect of 
NSRPS needs to the thoroughly investigated, especially for 
compressing chaotic sequences which are otherwise incom- 
pressible by standard techniques. 
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